4.4. Compressing Backup Data
Now that the key generation part is out of the way, let’s write the
tool that generates the backup itself. You’ll be writing the code for
this in a file called azbackup.py.
Users will pass in directories to back up to this little tool. You
have several ways of dealing with this input. One valid technique is to
encrypt every file separately. However, this quickly becomes a hassle,
especially if you have thousands of files to deal with. Thankfully, the
Unix world has had experience in doing this sort of thing for a few
decades.
Note: One of the earliest references to tar is from Seventh Edition Unix in 1979.
This is a descendant of the tap
program that shipped with First Edition Unix in 1971.
Backups are typically done in a two-step process. The first step
is to gather all the input files into a single file, typically with the
extension .tar. The actual file
format is very straightforward: the files are concatenated together with
a short header before each one.
Managing a single file makes your life easier because it enables
you to manipulate this with a slew of standard tools. Authoring scripts
becomes a lot easier. Compressing the file gives much better results
(because the compression algorithm has a much better chance of finding
patterns and redundancies). The canonical way to create a tar file out
of a directory and then compress the output using the gzip algorithm is
with the following command:
tar -cvf output.tar inputdirectory|gzip >output.tar.gz
Or you could use the shortcut version:
tar -cvzf output.tar.gz inputdirectory
To get back the original contents, the same process is done in
reverse. The tarred, gzipped file is decompressed, and then split apart
into its constituent files.
Example 5 shows the
code to do all of this inside azbackup.py. There are two symmetric functions: generate_tar_gzip and extract_tar_gzip. The former takes a directory
or a file to compress, and writes out a tarred, gzipped archive to a
specified output filename. The latter does the reverse—it takes an input
archive and extracts its content to a specified directory. The code
takes advantage of the tarfile module that ships with Python, and adds
support for all this.
Example 5. Compressing and extracting archives
import tarfile
def generate_tar_gzip(directory_or_file, output_file_name): if directory_or_file.endswith("/"): directory_or_file = directory_or_file.rstrip("/") # We open a handle to an output tarfile. The 'w:gz' # specifies that we're writing to it and that gzip # compression should be used out = tarfile.TarFile.open(output_file_name, "w:gz")
# Add the input directory to the tarfile. Arcname is the # name of the directory 'inside' the archive. We'll just reuse the name # of the file/directory here out.add(directory_or_file, arcname = os.path.basename(directory_or_file)) out.close()
def extract_tar_gzip(archive_file_name, output_directory): # Open the tar file and extract all contents to the # output directory extract = tarfile.TarFile.open(archive_file_name) extract.extractall(output_directory) extract.close()
|
4.5. Encrypting Data
azbackup will use the following three-step process to encrypt data (with
“data” here being the compressed archives generated from the previous
step):
For every archive, it’ll generate a unique 256 key. Let’s call
this key Ksym.
Ksym
is used to encrypt the archive using AES-256 in CBC mode. (You’ll
learn what “CBC” means in just a bit.)
Ksym
is encrypted by the user’s RSA encryption key (Kenc) and attached
to the encrypted data from the previous step.
Example 6 shows the code in the
crypto module corresponding to the previously described three
steps.
Example 6. Encrypting data
def generate_rand_bits(bits=32*8): """SystemRandom is a cryptographically strong source of randomness Get n bits of randomness"""
import random sys_random = random.SystemRandom() return long_as_bytes(sys_random.getrandbits(bits), bits/8)
def long_as_bytes(lvalue, width): """This rather dense piece of code takes a long and splits it apart into a byte array containing its constituent bytes with least significant byte first"""
fmt = '%%.%dx' % (2*width) return unhexlify(fmt % (lvalue & ((1L<<8*width)-1)))
def block_encrypt(data, key): """ High level function which takes data and key as parameters and turns it into IV + CipherText after padding. Note that this still needs a sig added At the end""" iv = generate_rand_bits(32 * 8) ciphertext = aes256_encrypt_data(data, key, iv)
return iv + ciphertext
def aes256_encrypt_data(data, key, iv): """ Takes data, a 256-bit key and a IV and encrypts it. Encryption is done with AES 256 in CBC mode. Note that OpenSSL is doing the padding for us""" enc =1 cipher = EVP.Cipher('aes_256_cbc', key,iv , enc,0)
pbuf = cStringIO.StringIO(data) cbuf = cStringIO.StringIO()
ciphertext = aes256_stream_helper(cipher, pbuf, cbuf) pbuf.close() cbuf.close() return ciphertext
def aes256_stream_helper(cipher, input_stream, output_stream):
while True: buf = input_stream.read() if not buf: break output_stream.write(cipher.update(buf)) output_stream.write(cipher.final()) return output_stream.getvalue()
def encrypt_rsa(rsa_key, data): return rsa_key.public_encrypt(data, RSA.pkcs1_padding)
|
That was quite a bit of code, so let’s break down what this code
does.
4.5.1. Generating a unique Ksym
The work of generating a random, unique key is done by
generate_rand_bits. This takes the number of
bits to generate as a parameter. In this case, it’ll be called with
256 because you are using AES-256. You call through to Python’s
random.SystemRandom to get a
cryptographically strong random number.
Note: It is important to use this rather than the built-in
random-number generator—cryptographically strong
random-number generators have a number of important security
properties that make them difficult to predict. Using Python’s
built-in random-number generator will cause an instant security
vulnerability because an attacker can predict the key and decrypt
data. As you can imagine, this is a common mistake, made even by
reputable software vendors.
Where does this cryptographically strong random-number generator
come from? In this case, Python lets the operating system do the heavy
lifting. On Unix this will call /dev/urandom, while on Windows this will
call CryptGenRandom. These are both
valid (and, in fact, the recommended) means of getting good random
numbers.
4.5.2. Encrypting using AES-256
After generating a unique Ksym, the next step
is to encrypt data using AES-256. The “256” here refers to the block size. AES is
a block cipher—it takes a block of size n (256,
in this case) and a key of length n, and then
converts into ciphertext of length n. The obvious
problem here is that the data is somewhat longer than 256 bits.
Not surprisingly, there are several mechanisms to deal with
this, and they are called modes of operation. In
this particular case, the chosen mode is cipherblock chaining (CBC). Figure 4 shows how this mode works. The incoming
data (plaintext) is split into block-size chunks. Each block of
plaintext is XORed with the
previous ciphertext block before being encrypted.
Why not just encrypt every block separately and concatenate all
the ciphertext?
This is what the Electronic Codebook
(ECB) mode does, and it is very insecure.
Typically, input data will have lots of well-known structures (file
format headers, whitespace, and so on). Since each block encrypts to
the same output ciphertext, the attacker can look for repeating forms
(the encrypted versions of the aforementioned structure) and glean
information about the data. CBC (the technique used here) prevents
this attack because the encrypted form of every block also depends on
the blocks that come before it.
This still leaves some vulnerability. Since the first few blocks
of data can be the same, the attacker can spot patterns in the
beginning of the data stream. To avoid this, the block cipher takes an
initialization vector
(IV). This is a block filled with random data
that the cipher will use as the “starting block.” This ensures that
any pattern in the beginning input data is undetectable.
This data doesn’t need to be secret and, in fact, is usually
added to the encrypted data in some form so that the receiver knows
what IV was used. However, it does need to be different for each
archive, and it can never be reused with the same key. In this sample
code, you generate IVs the same way you generate random keys: by
making a call to generate_rand_bits.
Note: Reusing the same IV is typically a “no-no.” Bad usage of IVs
is the core reason Wireless Encryption Protocol (WEP) is considered
insecure.
The core of the encryption work is done in aes256_encrypt_data. This takes the input
plaintext, Ksym, and a unique
IV. It creates an instance of the EVP.CipherEVP.Cipher class is best used in a streaming
mode. The little helper method aes256_stream_helper does exactly this. It
takes the cipher, the input data stream, and an output stream as
parameters. It writes data into the cipher object, and reads the
ciphertext into the output stream. class and specifies that it wants
to use AES-256 in CBC mode. The
Note: Again, these techniques can be used on any major programming
platform. In .NET, AES is supported through the System.Security.Cryptography.Rijndael
class.
All this is wrapped up by block_encrypt, which makes the actual call
to generate the IV, encrypts the incoming data, and then returns a
concatenated version of the encrypted data and the IV.
4.5.3. Encrypting Ksym using
Kenc
The final step is to encrypt Ksym using Kenc. Since this is
an RSA key pair, this encryption is done with the public key portion.
RSA is sensitive to the size and structure of the input data, so the
encryption is done using a well-known padding scheme supported by
OpenSSL.
The actual encryption is done by encrypt_rsa. This takes an RSA key pair as a
parameter (which, in this case, is a type in the M2Crypto package) and
calls a method on that object to encrypt the input data.
Note: The fact that you’re using only the public key portion to
encrypt is significant. Though no support for this was added as of
this writing, a different key format can be implemented that
separates the public key from the private key, and puts them in
different files. The encryption code must have access only to the
public key, and thus can run from insecure machines.
At the end of this process, you now have encrypted data and an
encrypted key in a large byte array that can then be uploaded to the
cloud.